Relational Association Rules: 
getting WARMeR 

Bart Goethals and Jan Van den Bussche 
University of Limburg, Belgium 

Abstract 

In recent years, the problem of association rule mining in transactional 
data has been well studied. We propose to extend the discovery of clas- 
sical association rules to the discovery of association rules of conjunctive 
queries in arbitrary relational data, inspired by the Warmr algorithm, 
developed by Dehaspe and Toivonen, that discovers association rules over 
a limited set of conjunctive queries. Conjunctive query evaluation in rela- 
tional databases is well understood, but still poses some great challenges 
when approached from a discovery viewpoint in which patterns are gen- 
erated and evaluated with respect to some well defined search space and 
pruning operators. 

1 Introduction 

In recent years, the problem of mining association rules over frequent itemsets in 
transactional data || has been well studied and resulted in several algorithms 
that can find association rules within a limited amount of time. Also more 
complex patterns have been considered such as trees 0, graphs [J^TJ, [l0|], or 
arbitrary relational structures |(|. However, the presented algorithms only 
work on databases consisting of a set of transactions. For example, in the tree 
case fl7| , every transaction in the database is a separate tree, and the presented 
algorithm tries to find all frequent subtrees occurring within all such transac- 
tions. Nevertheless, many relational databases are not suited to be converted 
into a transactional format and even if this were possible, a lot of informa- 
tion implicitly encoded in the relational model would be lost after conversion. 
Towards the discovery of association rules in arbitrary relational databases, 
Deshaspe and Toivonen developed an inductive logic programming algorithm, 
Warmr |5|, |(|, that discovers association rules over a limited set of conjunctive 
queries on transactional relational databases in which every transaction con- 
sists of a small relational database itself. In this paper, we propose to extend 
their framework to a broader range of conjunctive queries on arbitrary relational 
databases. 

Conjunctive query evaluation in relational databases is well understood, but 
still poses some great challenges when approached from a discovery viewpoint in 



which patterns are generated and evaluated with respect to some well defined 
search space and pruning operators. We describe the problems occurring in 
this mining problem and present an algorithm that uses a similar two-phase 
architecture as the standard association rule mining algorithm over frequent 
itemsets (Apriori) jjj, which is also used in the Warmr algorithm. In the first 
phase, all frequent patterns are generated, but now, a pattern is a conjunctive 
query and its support equals the number of distinct tuples in the answer of 
the query. The second phase generates all association rules over these patterns. 
Both phases are based on the general levelwise pattern mining algorithm as 
described by Mannila and Toivonen [fl2|| . 

In Section ||, we formally state the problem we try to solve. In Section |], 
we describe the general approach that is used for a large family of data min- 
ing problems. In Section |i[ we describe the Warmr algorithm which is also 
based on this general approach. In Section ^, we describe our approach as an 
generalization of the Warmr algorithm and identify the algorithmic challenges 
that need to be conquered. In Section^, we show a sample run of the presented 
approach. We conclude the paper in Section [7] with a brief discussion and future 
work. 



2 Problem statement 

The relational data model is based on the idea of representing data in tabular 
form. The schema of a relational database describes the names of the tables 
and their respective sets of column names, also called attributes. The actual 
content of a database, is called an instance for that schema. In order to retrieve 
data from the database, several query languages have been developed, of which 
SQL is the standard adopted by most database management system vendors. 
Nevertheless, an important and well-studied subset of SQL, is the family of 
conjunctive queries. 

As already mentioned in the Introduction, current algorithms for the discov- 
ery of patterns and rules mainly focused on transactional databases. In practice, 
these algorithms use several specialized data structures and indexing schemes 
to efficiently find their specific type of patterns, i.e., itemsets, trees, graphs, and 
many others. As an appropriate generalization of these kinds of patterns, we 
propose a framework for arbitrary relational databases in which a pattern is a 
conjunctive query. 

Assume we are given a relational database consisting of a schema R and an 
instance I of R. An atomic formula over R is an expression of the form R(x), 
where R is a relation name in R and x is a fc-tuple of variables and constants, 
with k the arity of R. 

Definition 1. A conjunctive query Q over R consists of a head and a body. The 
body is a finite set of atomic formulas over R. The head is a tuple of variables 
occurring in the body. 

A valuation on Q is a function / that assigns a constant to every variable 
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in the query. A valuation is a matching of Q in I, if for every R(x) in the body 
of Q, the tuple f(x) is in I(R). The answer of Q on I is the set 

Q(T) :— {f(y) | y is the head of Q and / is a matching of Q on I}. 

We will write conjunctive queries using the commonly used Prolog notation. 
For example, consider the following query on a beer drinkers database: 

Q{x) :- likes(x, 'Duvel'), likes (x, 'Trappist'). 

The answer of this query consists of all drinkers that like Duvel and also like 
Trappist. 

For two conjunctive queries Q\ and Q 2 over R, we write Q\ C Q 2 if for every 
possible instance I of R, Qi(I) C Q 2 (X) an( i sa Y that Q\ is contained in Q 2 . Qi 
and Q 2 are called equivalent if and only if Q± C Q 2 and Q 2 C Q\. Note that 
the question whether a conjunctive query is contained in another conjunctive 
query is decidable fll6fl . 

Definition 2. The support of a conjunctive query Q in an instance I is the 
number of distinct tuples in the answer of Q on I. A query is called frequent in 
I if its support exceeds a given minimal support threshold. 

Definition 3. An association rule is of the form Q\ =>■ Q 2 , such that Q\ and 
Q 2 are both conjunctive queries and Q 2 C Qi. An association rule is called 
frequent in I if Q 2 is frequent in I and it is called confident if the support of Q 2 
divided by the support of Q\ exceeds a given minimal confidence threshold. 

Example 1. Consider the following two queries: 

Qi(x,y) :- likes 'Duvel'), visits (x,y). 

Q 2 {x, y) :- likes{x, 'Duvel'), visits{x, y), serves(y, 'Duvel'). 

The rule Q\ Q 2 should then be read as follows: if a person x that likes Duvel 
visits bar y, then bar y serves Duvel. 

A natural question to ask is why we should only consider rules over queries 
that are contained for any possible instance. For example, assume we have the 
following two queries: 

Qi{y) :- likes (x/DuYeY), visits (x,y). 
Q2G/) :~ serves (y, 'Duvel'). 

Obviously, Q 2 is not contained in Qi and vice versa. Nevertheless, it is still 
possible that for a given instance I, we have Q 2 (T) C Qi(I), and hence this 
could make an interesting association rule Qi => Q 2 , which should be read as 
follows: if bar y has a visitor that likes Duvel, then bar y also serves Duvel. 
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Proposition 1. Every association rule Q\ => Q2, such that Q2(X) Qi(I)> 
can be expressed by an association rule Q\ Q' 2 , with Q' 2 = Q2 H Qi, and 
essentially has the same meaning. 

In this case the correct rule would be Q\ =>- Q2, with 

Qi(y) :- likes (x, l T>vcve¥), visits (x,y). 

Q2G/) :_ Kkes(x, 'Duvel'), visits(x, y), serves(y, 'Duvel'). 

Note the resemblance with the queries used in Example 1. The bodies of the 
queries are the same, but now we have another head. Evidently, different heads 
result in a different meaning of the corresponding association rule which can 
still be interesting. As another example, note the difference with the following 
two queries: 

Qi(x) :- likes (2, 'Duvel'), visits (x,y). 

Q2(x) :- likes(x, 'Duvel'), visits(x, y), serves(y, 'Duvel'). 

The rule Q\ => Q2 should then be read as follows: if a person x that likes Duvel 
visits a bar, then x also visits a bar that serves Duvel. 

The goal is now to find all frequent and confident association rules in the 
given database. 

3 General approach 

As already mentioned in the introduction, most association rule mining algo- 
rithms use the common two-phase architecture. Phase I generates all frequent 
patterns, and phase 2 generates all frequent and confident association rules. 

The algorithms used in both phases are based on the general levelwise pat- 
tern mining algorithm as described by Mannila and Toivonen Given a 
database T>, a class of patterns C, and a selection predicate q, the algorithm 
finds the "theory" of V with respect to C and q, i.e., the set Th(£, V, q) := {<fr £ 
C I q(D, (j)) is true}. The selection predicate q is used for evaluating whether a 
pattern Q £ C defines a (potentially) interesting pattern in T>. The main prob- 
lem this algorithm tries to tackle is to minimize the number of patterns that 
need to be evaluated by q, since it is assumed this evaluation is the most costly 
operation of such mining algorithms. The algorithm is based on a breadth-first 
search in the search space spanned by a specialization relation which is a partial 
order ^ on the patterns in C. We say that <fi is more specific than ip, or tp is more 
general than 0, if (j) ^ tp. The relation ^ is a monotone specialization relation 
with respect to q, if the selection predicate q is monotone with respect to ^, i.e., 
for all V and (j), we have the following: if q(T>,(j)) and </> ^ 7, then q(V, 7). In 
what follows, we assume that ■< is a monotone specialization relation. We write 
4> -< ip if (j) di "0 an( i n °f V* ^ 4>- The algorithm works iteratively, alternating 
between candidate generation and candidate evaluation, as follows. 
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C\ := {(j) G C | there is no 7 in £ such that <p < 7}; 
i:=l; 

while C 4 ^ do 

/ / Candidate evaluation 
Ti~ {0eCi|g(D,0)} ; 
/ / Candidate generation 

Cj+i := {0 G £ I for all 7, such that X 7, we have 7 G Uj<j ^'}\ Uj<i 

i := i + 1 
end while 
return Uj<i^n 

In the generation step of iteration i, a collection Cj+i of new candidate patterns 
is generated, using the information available from the more general patterns in 
\Jj <i J 7 j, which have already been evaluated. Then, the selection predicate is 
evaluated on these candidate patterns. The collection .Fj+i will consist of those 
patterns in Cj + i that satisfy the selection predicate q. The algorithm starts 
by constructing C\ to contain all most general patterns. The iteration stops 
when no more potentially interesting patterns can be found with respect to the 
selection predicate. 

In general, given a language C from which patterns are chosen, a selection 
predicate q and a monotone specialization relation ^ with respect to q, this 
algorithm poses several challenges. 

1. An initial set C\ of most general candidate patterns needs to be identified, 
which is not always possible for infinite languages, and hence other, maybe 
less optimal solutions could be required. 

2. Given all patterns \} 3<i ^3 that satisfy the selection predicate up to a 
certain level i, the set Cj+i of all candidate patterns must be generated 
efficiently. It might be impossible to generate all but only those elements 
in Cj+i, but instead, it might be necessary to generate a superset of C,+i 
after which the non candidate patterns must be identified and removed. 
Even if this identification is efficient, naively generating all possible pat- 
terns could still become infeasible if this number of patterns becomes too 
large. Hence, this poses two additional challenges: 

(a) efficiently generate the smallest possible superset of Cj + i, and 

(b) identify and remove each generated pattern that is no candidate pat- 
tern by efficiently checking whether all of its generalizations arc in 

3. Extract all patterns from Cj+i that satisfy the selection predicate q, by 
efficiently evaluating q on all elements in Cj+i. 

In the next section, we identify these challenges for both phases of the as- 
sociation rule mining problem within the framework proposed by Dehaspe and 
Toivonen, and describe their solutions as implemented within the Warmr al- 
gorithm. 
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4 The Warmr algorithm 



As already mentioned in the introduction, a first approach towards the goal of 
discovering all frequent and confident association rules in arbitrary relational 
databases, has been presented by Dehaspe and Toivonen, in the form of an in- 
ductive logic programming algorithm, Warmr jE], |(| , that discovers association 
rules over a limited set of conjunctive queries. 

4.1 Phase 1 

The procedure to generate all frequent conjunctive queries is primarily based 
on a declarative language bias to constrain the search space to a subset of all 
conjunctive queries, which is an extensively studied subfield in ILP |l3| ]. 

The declarative language bias used in Warmr drastically simplifies the 
search space of all queries by using the Warmode formalism. This formal- 
ism requires two major constraints. The most important constraint is the key 
constraint. This constraint requires the specification of a single key atomic for- 
mula which is obligatory in all queries. This key atomic formula also determines 
what is counted, i.e., it determines the head of the query, that is, all variables 
occuring in the key atom. Second, it requires a list Atoms of all atomic formu- 
las that are allowed in the queries that will be generated. In the most general 
case, this list consists of the relation names in the database schema R. If one 
also wants to allow certain constants within the atomic formulas, then these 
atomic formulas must be specified for every such constant. In the most gen- 
eral case, the complete database instance must also be added to the Atoms list. 
The Warmode formalism also allows other constraints, but since these are not 
obligatory, we will not discuss them any further. 

Example 2. Consider 

Atoms :— {likes(_, 'Duvel'), 

likes{_, 'Trappist'), 
serves(_, 'Duvel'), 
serves^-, 'Trappist')}, 

where _ stands for an arbitrary variable, and 

key := visits(_, _). 
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Then, 



C = {Q{xi,x 2 ) :- visits(xi,x 2 ), likes(x 3 , 'Duvel'). 

Q{x\,X2) ■- visits{x\,X2), likes{x\, 'Duvel'). 

Q{x\, X2) '■- visits{x\, X2), serves(x3, 'Duvel'). 

Q(xi, X2) '.- visits(x\, X2), serves(x2, 'Duvel'). 

Q(xi,x 2 ) :- visits{x\,X2), likes(x\, 'Duvel'), serves(x2, 'Duvel'). 

Q(xi,x 2 ) :- visits(xi,X2), likes(x\, 'Duvel'), serves(x 2 , 'Trappist'). 



As can be seen, these constraints already dismiss a lot of interesting patterns. 
However, it is still possible to discover all frequent conjunctive queries, but then, 
we need to run the algorithm for every possible key atomic formula with the 
least restrictive declarative language bias. Of course, using this strategy, a lot 
of possible optimizations are left out, as will be shown in the next section. 

The specialization relation used in Warmr is defined Qi ^ Q 2 if Qi C Q 2 . 
The selection predicate q is the minimal support threshold, which is indeed 
monotone with respect to ^, i.e., for every instance I and conjunctive queries 
Qi and Q 2 , we have the following: if Qi is frequent and Qi C Q 2 , then Q 2 is 
frequent. 

Candidate generation In essence, the Warmr algorithm generates all con- 
junctive queries contained in the query Q(x) :- R(x), where R(x) is the key 
atomic formula. Denote this query by the key conjunctive query. Hence, the 
key conjunctive query is the (single) most general pattern in C\. Assume we are 
given all frequent patterns up to a certain level i, [jj <i Fj ■ Then, Warmr gener- 
ates a superset of all candidate patterns, by adding a single atomic formula, from 
Atoms, to every query in as allowed by the Warmode declarations. From 
this set, every candidate pattern needs to be identified by checking whether all of 
its generalizations are frequent. However, this is no longer possible, since some 
of these generalizations might not be in the language of admissible patterns. 
Therefore, only those generalizations that satisfy the declarative language bias 
need to be known frequent. In order to do this, for each generated query Q, 
Warmr scans all infrequent conjunctive queries for one that is more general 
than Q. However, this does not imply that all queries that are more general 
than Q are known to be frequent! Indeed, consider the following example which 
is based on the declarative language bias from the previous example. 

Example 3. 



...}. 



Qi{xi,x 2 ) 
Q 2 (xi,x 2 ) 



visits(xi, X2), likes{x\, 'Duvel'). 
visits (x 1^x2), Hkes(x3, 'Duvel'). 
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Both queries are single extensions of the key conjunctive query, and hence, they 
are generated within the same iteration. Obviously, Qi is more general than 
Ql, but still, both queries remain in the set of candidate queries. Moreover, it 
is necessary that both queries remain admissible, in order to guarantee that all 
frequent conjunctive queries are generated. 

This example shows that the candidate generation step of Warmr does 
not comply with the general levelwise framework given in the previous section. 
Indeed, at a certain iteration, it generates patterns of different levels in the 
search space spanned by the containment relation. 

The generation strategy also generates several queries that are equivalent 
with other candidate queries, or with queries already generated in previous iter- 
ations, which also need to be identified and removed from the set of candidate 
patterns. Again, for each candidate query, all other candidate queries and all 
frequent queries are scanned for an equivalent query. Unfortunately, the ques- 
tion whether two conjunctive queries are equivalent is an NP-complete problem. 
Note that isomorphic queries are definitely equivalent (but not vice versa in 
general), and also the problem of efficiently generating finite structures up to 
isomorphism, or testing isomorphism of two given finite structures efficiently, is 
still an open problem 0. 

Candidate evaluation Since Warmr is an inductive logic programming al- 
gorithm written within a logic programming environment, the evaluation of all 
candidate queries is performed inefficiently. Still, Warmr uses several optimiza- 
tions to increase the performance of this evaluation step, but these optimizations 
can hardly be compared to the optimized query processing capabilities of rela- 
tional database systems. 

4.2 Phase 2 

The procedure to generate all association rules in Warmr, simply consists of 
finding all couples (Qi,Qi) m the list of frequent queries, such that Qi is con- 
tained in Qi. We were unable to find how this procedure exactly works, that 
is, how is each query Qi found, given query Q\. Anyhow, in general, this phase 
is less of an efficiency issue, since the supports of all queries that need to be 
considered are already known. 

5 Getting WARMeR 

Inspired by the framework of Warmr, we present in this section a more general 
framework and investigate the efficiency challenges described in Section More 
specifically, we want to discover association rules over all conjunctive queries 
instead of only those queries contained in a given key conjunctive query since 
it might not always be clear what exactly needs to be counted. For example, in 
the beer drinkers database, the examples given in section || show that different 
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heads could lead to several interesting association rules about the drinkers, 
the bars or the beers separately. We also want to exploit the containment 
relationship of conjunctive queries as much as possible, and avoid situations such 
as described in example ||. Indeed, the Warmr algorithm does not fully exploit 
the different levels induced by the containment relationship, since it generates 
several candidate patterns of different levels within the same iteration. 

5.1 Phase 1 

The goal of this first phase is to find all frequent conjunctive queries. Hence, C 
is the family of all conjunctive queries. 

Since only the number of different tuples in the answer of a query is impor- 
tant and not the content of the answer itself, we will extend the notion of query 
containment, such that it can be better exploited in the levelwise algorithm. 

Definition 4. A conjunctive query Qi is diagonally contained in Q 2 if Qi is 
contained in a projection of Q 2 . We write Q\ C A Q 2 . 

Example 4. 

Qi(x) :- likes (x,y), visits (x, z), serves (z,y) 
Q 2 (x, z) :- likes{x, y), visits{x, z), serves(z, y) 

The answer of Qi consists of all drinkers that visit at least one bar that serve at 
least one beer they like. The answer of Q 2 consists of all visits of a drinker to 
a bar if that bar serves at least one beer the drinker likes. Obviously, a drinker 
could visit multiple bars that serve a beer they like, and hence all these bars 
will be in the answer of Q 2 together with that drinker, while Q\ only gives the 
name of that drinker, and hence, the number of tuples in the answer of Q\ will 
always be smaller or equal than the number of tuples in the answer of Q 2 . 

We now define Q\ -< Q 2 if Q\ C A Q 2 . The selection predicate q is the 
minimal support threshold, which is indeed monotone with respect to ^, i.e., 
for every instance I and conjunctive queries Qi and Q 2 , we have the following: 
if Qi is frequent and Q\ C A Q 2l then Q 2 is frequent. Notice that the notion of 
diagonal containment now allows the incorporation of conjunctive queries with 
different heads within the search space spanned by this specialization relation. 

Two issues remain to be solved: how are the candidate queries efficiently 
generated without generating two equivalent queries? and how is the frequency 
of each candidate query efficiently computed? 

Candidate generation As a first optimization towards the generation of all 
conjunctive queries, we will already prune several queries in advance. 

1. The head of a query must contain at least one variable, since the support 
of a query with an empty head can be at most I . Hence, we already know 
its support after we evaluate a query with the same body but a nonempty 
head. 
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2. We allow only a single permutation of the head, since the supports of 
queries with an equal body but different permutations of the head are 
equal. 

Generating candidate conjunctive queries using the levelwise algorithm re- 
quires an initial set of all most general queries with respect to C A . However, 
such queries do not exist. Indeed, for every conjunctive query Q, we can con- 
struct another conjunctive query Q', such that Q C A Q' by simply adding a 
new atomic formula with new variables into the body of Q, and adding these 
variables to the head. A rather drastic but still reasonable solution to this prob- 
lem is to apriori limit the search space to conjunctive queries with at most a 
fixed number of atomic formulas in the body. Then, within this space, we can 
look at the set of most general queries, and this set now is well-defined. 

At every iteration in the levelwise algorithm we need to generate all candi- 
date conjunctive queries up to equivalence, such that all of their generalizations 
are known to be frequent. Since an algorithm to generate exactly this set is 
not known, we will generate a small superset of all candidates and afterwards 
remove each query of which a generalization is not known to be frequent (or 
known to be infrequent). 

Nevertheless, any candidate conjunctive query is always more specific than 
at least one query in Ti. Hence, we can generate a superset of all possible 
candidate queries using the following four operations on each query in T{. 

Extension: We add a new atomic formula with new variables to the body. 

Join: We replace all occurrences of a variable with another variable already 
occurring in the query. 

Selection: We replace all occurrences of a variable x with some constant. 

Projection: We remove a variable from the head if this does not result in an 
empty head. 

Example 5. This example shows a single application of each operation on the 
query 

Q(x, y) :- likes(x, y), visits(x, z), serves(z, u). 

Extension: 

Q(x, y) :- likes(x, y), visits(x, z), serves(z, u), likes(v, w). 

Join: 

Q(x, y) :- likes(x, y), visits(x, z), serves(z, y). 

Selection: 

Q(x, y) :- likes(x, y), visits(x, z), serves(z 7 'Duvel'). 

Projection: 

Q{x) :- likes (x,y), visits (x, z), serves(z,u). 
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The following proposition implies that if we apply a sequence of these four 
operations on the current set of frequent conjunctive queries, we indeed get at 
least all candidate queries. 

Proposition 2. Qi C A Q 2 if and only if a query equivalent to Qi can be 
obtained from Q 2 by applying some finite sequence of extension, join, selection 
and projection operations. 

Nevertheless, using these operations, several equivalent or redundant queries 
can be generated. An efficient algorithm avoiding the generation of equivalent 
queries is still unknown. Hence, whenever we generate a candidate query, we 
need to test whether it is equivalent with another query we already generated. 
In order to keep the generated superset of all candidate conjunctive queries 
as small as possible, we apply an operator once on each query. If the query 
is redundant or equivalent with a previously generated query, we repeatedly 
apply any of the operators until a query is found that is not equivalent with 
a previously generated query. As already mentioned in the previous section, 
testing equivalence cannot be done efficiently. 

After generating this superset of all candidate conjunctive queries, we need 
to check for each of them whether all more general conjunctive queries are 
known to be frequent. This can be done by performing the inverses of the 
four operations extension, join, selection and projection, as described above. 
Even if we now assume that in the set of all frequent conjunctive queries there 
exist no two equivalent queries, we still need to find the query equivalent to 
the one generated using the inverse operations. Hence, the challenge of testing 
equivalence of two conjunctive queries reappears. 

Candidate evaluation After generating all candidate conjunctive queries, 
we need to test which of them are frequent. This can be done by simply evalu- 
ating every candidate query on the database, one at a time, by translating each 
query to SQL. Although conjunctive query evaluation in relational databases is 
well understood and several efficient algorithms have been developed (i.e., join 
query optimisation and processing) |@] , this remains a costly operation. Within 
database research, a lot of research has been done on multi-query optimiza- 
tion |fT5f . Here, one tries to efficiently evaluate multiple queries at once. Unfor- 
tunately, these techniques are not yet implemented in most common database 
systems. 

As a first optimization towards query evaluation, we can already derive the 
support of a significant part of all candidate conjunctive queries. Therefore, we 
only consider those candidate queries that satisfy the following restrictions. 

1. We only consider queries that have no constants in the head, because the 
support of such queries is equal to the support of those queries in which 
the constant is not in the head. 

2. We only consider queries that contain no duplicate variables in the head, 
since the support of such a query is equal to the support of the query 
without duplicates in the head. 
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As another optimization, given a query involving constants, we will not treat 
every variation of that query that uses different constants as a separate query, 
but rather we can evaluate all those variations in a single global query. For 
example, suppose the query 

Q(xi) :- R(xi,x 2 ) 

is frequent. From this query, a lot of candidate queries are generated using the 
selection operation on x 2 . Assume the active domain of x 2 is 1,2,... , n, then 
the set of candidate queries contains at least 

{Q{x x ) :- R(x t , 1), Q(xi) :- R{x u 2), . . . ,Q(x x ) :- R(x u n)}, 

resulting in a possibly huge amount of queries that need to be evaluated. How- 
ever, the support of all these queries can be computed by evaluating only the 
single SQL query 

select x 2 , count(*) 
from R 
group by x 2 

having count(*) > minsup 

of which the answer consists of every possible constant c for x 2 together with 
the support of the corresponding query Q(xi) :- R(xi,c). From now on, we 
will therefore use only a symbolic constant to denote all possible selections of a 
given variable. For example, Q(xi) :- R(x x ,ci) denotes the set of all possible 
selections for x 2 in the previous example. A query with such a symbolic constant 
is then considered frequent if it is frequent for at least one constant. 

As can be seen, several optimizations can be used to improve the performance 
of the evaluation step in our algorithm. Also, we might be able to use some of 
the techniques that have been developed for frequent itemset mining, such as 
closed frequent itemsets Q, free sets and non derivable itemsets J|. These 
techniques could then be used to minimize the number of candidate queries that 
need to be executed on the database, but instead we might be able to compute 
their supports based on the support of previously evaluated queries. Another 
interesting optimization could be to avoid using SQL queries completely, but in- 
stead use a more intelligent counting mechanism that needs to scan the database 
or the materialized tables only once, and count the supports of all queries at 
the same time. 

5.2 Phase 2 

The goal of the second phase is to find for every frequent conjunctive query 
Q, all confident association rules Q' =>• Q. Hence, we need to run the general 
level wise algorithm separately for every frequent query. That is, for any given 
Q, C consists of all conjunctive queries Q' , such that Q C Q' . Assume we are 
given two association rules AR\ : Qi => Q 2 and AR 2 : Q 3 Q 4 , we define 
ARi < AR 2 if Q 3 C Qi and Q 2 C Q 4 . 
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Figure 1: Instance of the beer drinkers database. 



The selection predicate q is the minimal confidence threshold which is again 
monotone with respect to ^, i.e., for every instance I and association rules 
ARi : Qi => Q2 and AR2 : Q3 => Q4, we have the following: if AR\ is frequent 
and confident and AR\ < AR2, then AR2 is frequent and confident. 

Here, only a single issue remains to be solved: how are the candidate queries 
efficiently generated without generating two equivalent queries? 

We have to generate, for every frequent conjunctive query Q, all conjunctive 
queries Q' , such that Q C Q' and minimize the generation of equivalent queries. 
In order to do this, we can use three of the four inverse operations described for 
the previous phase, i.e., the inverse extension, inverse join and inverse selection 
operations. We do not need to use the inverse projection operation since we do 
not want those queries that are diagonally contained in Q, but only those queries 
that are regularly contained in Q as defined in Section 2. Still, several queries 
will be generated which are equivalent with previously generated queries, and 
hence this should again be tested. 



6 Sample run 

Suppose we are given an instance of the beer drinkers database used throughout 
this paper, as shown in Figure [l]. 

We now show a small part of an example run of the algorithm presented in 
the previous section. In the first phase, all frequent conjunctive queries need to 
be found, starting from the most general conjunctive queries. Let the maximum 
number of atoms in de body of the query be limited to 2, and let the minimal 
support threshold be 2, i.e., at least 2 tuples are needed in the output of a query 
to be considered frequent. Then, the initial set of candidate queries Ci, consists 
of the 6 queries as shown in Figure ||. Obviously, the support of each of these 
queries is 36, and hence, F\ = C\. To generate all candidate conjunctive queries 
for level 2, we need to apply the four specialization operations to each of these 6 
queries. Obviously, the extension operation is not yet allowed, since this would 
result in a conjunctive queries with 3 atoms in their bodies. We can apply the 
Join operation on Qi, resulting in queries Q7 and Qg, as shown in Figure [| Sim- 
ilarly, the join operation can be applied to Q4 and Qq, resulting in Qg, Q10 and 
Q11, Q12 respectively. However, the Join operation is not allowed on Q2, Q3 and 
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Figure 2: Level 1. 



Qr(x\ , X2, X3) :- likes(xi, X2), likes(xi, X3) 
Qs(xi,X2,X3) :- likes(xi, X2), likes(x2, X3) 
Qq(xi, X2, X3) :- visits (x 1, x 2), visits (x 1, x 3) 
Qlo(xi, X2, X3) :- visits(xi, X2), visits(x2,x 3 ) 
Qll(xi,X2,X3) :- serves(x\ , X2), serves(x\ , X3) 
Qi2(xi,X2,X3) :- serves(x\,X2),serves(x2,X3) 
Ql3(x2, X3, X4) :- Ukes(xi,X2),likes(x3,X4) 



Qi7(xi,X2,X3) :- serves(xi,X2), serves(x3,Xi) 



Figure 3: Level 2. 



Q5, since for each of them, there always exists a query in which it is contained 
and which is not yet known to be frequent. For example, if we join x\ and x% 
in query Q2, resulting in Q(xi, X2, £3) :~ likes{xi,X2) 1 visits(xi 1 x^), then this 
query is contained in Q'(xi, X2, X4) :- likes(xi, X2), visits(xs 7 X4), of which the 
frequency is not yet known. Similar situations occur for the other possible joins 
on Q2, Q3 and Q5. The selection operation can also not be applied to any of the 
queries, since for each variable we would select, there always exists a more gen- 
eral query in which that variable is projected, but not selected, and hence, the 
frequency of such queries is yet unknown. We can apply the projection operator 
on any variable of queries Qi through Q e , resulting in queries Q13 to Q37. In 
stead of showing the next levels for all possible queries, we will show only single 
path, starting from query Q7. On this query, we can now also apply the projec- 
tion operation on X3. This results in a redundant atom which can be removed, 
resulting in the query Q' 7 (xi,X2) ■- likes(xi, x 2 ). Again, for the next level, we 
can use the projection operation on X2, now resulting in Q"(xi) :- likes (xi, a; 2 )- 
Then, for the following level, we can use the selection operation on X2 , resulting 
in the query Q"'(xi) :- likes(xi, 'DuveP). Note that if we had selected X2, using 
the constant 'Trappist', then the resulting query would not have been frequent 
and would have been removed for further consideration. If we repeatedly apply 
the four specialization operations until the lcvclwise algorithm stops, because 
no more candidate conjunctive queries could be generated anymore, the second 
phase can start generating confident association rules from all generated fre- 
quent conjunctive queries. For example, starting from query Q' 7 ", we can apply 
the inverse selection operation, resulting in Q". Since both these queries have 
support 3, the rule Q" => Q"' holds with 100% confidence, meaning that every 
drinker that likes a beer, also likes Duvel, according to the given database. 
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7 Conclusions and future research 



In the future, we plan to study subclasses of conjunctive queries for which 
there exist efficient candidate generation algorithms up to equivalence. Possibly 
interesting classes are conjunctive queries on relational databases that consist of 
only binary relations. Indeed, every relational database can be decomposed into 
a database consisting of only binary relations. If necessary, this can be further 
simplified by only considering those conjunctive queries that can be represented 
by a tree. Note that one of the underlying challenges that always reappears is the 
equivalence test, which can be computed efficiently on tree structures. Other 
interesting subclasses are the class of acyclic conjunctive queries and queries 
with bounded query-width, since also for these structures, equivalence testing 
can be done efficiently 0. 

However, by limiting the search space to one of these subclasses, Proposi- 
tion ^ is no longer valid, since the intersection of two queries within such a 
subclass does not necesserally result in a conjunctive query which is also in that 
subclass. 

Another important topic is the improvement of performance issues for eval- 
uating all candidate queries. Also the problem of allowing flexible constraints 
to efficiently limit the search space to an interesting subset of all conjunctive 
queries, is an important research topic. 

References 

[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast 
discovery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, 
P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery 
and Data Mining, pages 307-328. MIT Press, 1996. 

[2] J-F. Boulicaut, A. Bykowski, and C. Rigotti. Free-sets: a condensed repre- 
sentation of boolean data for frequency query approximation. Data Mining 
and Knowledge Discovery, 2001. To appear. 

[3] T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In 
Proceedings of the 6th European Conference on Principles of Data Mining 
and Knowledge Discovery, Lecture Notes in Computer Science. Springer- 
Verlag, 2002. to appear. 

[4] C. Chekuri and A. Rajaraman. Conjunctive query containment revisited. 
Theoretical Computer Science, 239(2):211-229, 2000. 

[5] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data 
Mining and Knowledge Discovery, 3(l):7-36, 1999. 

[6] L. Dehaspe and H. Toivonen. Discovery of relational association rules. In 
S. Dzeroski and N. Lavrac, editors, Relational data mining, pages 189-212. 
Springer- Verlag, 2001. 



15 



[7] S. Fortin. The graph isomorphism problem. Technical Report 96-20, Uni- 
versity of Alberta, Edmonton, Alberta, Canada, July 1996. 

[8] H. Garcia-Molina, J. Ullman, and J. Widom. database system implemen- 
tation. Prentice-Hall, 2000. 

[9] D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT 
Press, 2001. 

[10] A. Inokuchi and H. Motoda T. Washio. An apriori-based algorithm for 
mining frequent substructures from graph data. In Proceedings of the J^th 
European Conference on Principles of Data Mining and Knowledge Dis- 
covery, volume 1910 of Lecture Notes in Computer Science, pages 13-23. 
Springer- Verlag, 2000. 

[11] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proceed- 
ings of the 2001 IEEE International Conference on Data Mining, pages 
313-320. IEEE Computer Society, 2001. 

[12] H. Mannila and H. Toivonen. Levelwise search and borders of theories in 
knowledge discovery. Data Mining and Knowledge Discovery, l(3):241-258, 
November 1997. 

[13] S.H. Nicnhuys-Chcng and R. de Wolf. Foundations of Inductive Logic Pro- 
gramming, volume 1228 of Lecture Notes in Artificial Intelligence. Springcr- 
Verlag, 1997. 

[14] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent 
closed itemsets for association rules. In Proceedings of the 7th International 
Conference on Database Theory, volume 1540 of Lecture Notes in Computer 
Science, pages 398-416. Springer- Verlag, 1999. 

[15] P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and extensible 
algorithms for multi query optimization. In Proceedings of the 2000 A CM 
SIGMOD International Conference on Management of Data, volume 29:2 
of SIGMOD Record, pages 249-260. ACM Press, 2000. 

[16] J.D. Ullman. Principles of database and knowledge-base systems, volume 2, 
volume 14 of Principles of Computer Science. Computer Science Press, 
1989. 

[17] M. Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the 
Eight ACM SIGKDD International Conference on Knowledge Discovery 
and Data Mining. ACM Press, 2002. to appear. 



16 



